Explore and Summarize Data

I have chosen Red Wine Quality dataset. You can download the data using this link.

This tidy data set contains 1 , 599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine , providing a rating between 0 (very bad) and 10 (very excellent).

Attribute information:

Input variables (based on physicochemical tests):
  • fixed acidity (tartaric acid - g / dm^3)

  • volatile acidity (acetic acid - g / dm^3)

  • citric acid (g / dm^3)

  • residual sugar (g / dm^3)

  • chlorides (sodium chloride - g / dm^3

  • free sulfur dioxide (mg / dm^3)

  • total sulfur dioxide (mg / dm^3)

  • density (g / cm^3)

  • pH

  • sulphates (potassium sulphate - g / dm3)

  • alcohol (% by volume)

Output variable (based on sensory data):
  • quality (score between 0 and 10)

Overview of the data

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Variable X is used for indexing the dataset. Let’s look at a general summary of the data. Let’s remove it.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Let’s look at the data.

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Univariate Plots Section

Let’s plot a histogram of different variables. Since quality is factored variable , let’s factor it and plot a histogram.
I have added a new variable to the data namely , wine_quality which signifies the value of wine as low , average or high.The values are changed as: 0-4 low 5,6 average >7 high

This plot shows count of each level of quality of wine. Let’s have a look at density variable in the dataset. This distribution is normal, hence no need to fix it.

The distribution of fixed acidity is right skewed. Let’s take a log transformation and see if we can fix it.

This appears more resonable now , with most values concentrated in 7-9 fixed.acidity.
Now let’s have a look at volatile.acidity.

The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel , right skewed or normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

It’s not clear what is the distribution of this variable. It appears to be bimodal with two peaks at 0 and 0.5 using the first plot , and taking log transformation(2nd plot) doesn’t help either. Let’s take a look at it’s box plot.

The boxplot shows the median value to be just above 0.25 , and every point is within 1.5 times the Inter Quartile Range. Now let’s look at residual sugar histogram plot.

This drive is also right skewed. Let’s take log transformation and see if we can fix it after removing top 5% of the data.

Now this distribution looks normal. Similarly , let’s look at chlorides variable distribution.

The alcohol content can be another important consideration when we are purchasing wine:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

It looks like the alcohol content of the wine in the dataset follows a lognormal distribution with a high peak at the lower end of the alcohol scale.
Let’s have a look at pH levels.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Univariate Analysis

What is the structure of your dataset?

There are 1 , 599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity , volatile.acidity , citric.acid , residual.sugar , chlorides , free.sulfur.dioxide , total.sulfur.dioxide ,
density , pH , sulphates , alcohol , and quality).

Other observations:

The median quality is 6. Most wines have a pH of 3.2 or higher. About 75% of wine have quality that is lower than 6. The median percent alcohol content is 10.20 and the max percent alcohol content is 14.90.

Of the features you investigated , were there any unusual distributions?

Did you perform any operations on the data to tidy , adjust , or change

the form of the data? If so , why did you do this?

I found out that citric acid has an unusual distrubution in the dataset. Since the data was tidy , I did no modification on my own.

Are there any new variables created?

Yes, I created a new variable wine_quality to reduce the number of data points while plotting different features to quality level of wine, categorising 0-4 as low quality , 5 and 6 as average quality while >7 as high quality.

Bivariate Plots Section

We can quickly visualize the relationship between each pair of variables and find their pearson product-moment correlation.

From the plot , we can see that top 3 correlated variables with quality are alcohol , sulphates and citric.acid.
And most un-correlated variables are volatile.acidity , total.sulfur.dioxide and density. Now , this seems reasonable since in wine most acids used are fixed acids. Let’s look a few of these relationships in a bit more detail.

Density and alcohol

We see that density tends to increase with decreasing alcohol content. Let’s look at the correlation between the two and check if it’s true.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

This verifies the plot.

Quality and alcohol

It looks like the red wines with a higher alcohol content tend to have a higher quality rating…what a surprise!

## factor(wine$quality): 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## factor(wine$quality): 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## factor(wine$quality): 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## factor(wine$quality): 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## factor(wine$quality): 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## factor(wine$quality): 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

The above assertion can be verified since wine qualities of 7 and 8 have alcohol content higher than the rest.

Quality and volatile.acidity

The graph shows a very clear trend; the lower volatile acidity is , the higher the quality becomes. The correlation coefficient between quality and volatile acidity is -0.39. This can be explained by the fact that volatile acidity at too high of levels can lead to an unpleasant , vinegar taste.

This is a weak positive relationship , but still higher the sulphates , higher the quality.

Density and Quality

There is no general trend here, but just by observing at the plot we can see that the quality increases as density decreases. I am not sure that should be true. # Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in the dataset? I observed a negative relationships between quality level and volatile acidity ,and positive correlation between quality level and alcohol. I am not suprised at this result , because men tend to grade stronger wines as high quality ,
whereas wines with low percent alcohol are often not graded as such. High volatile acidity is also perceived to be undesirable because it impacts the taste of wines. Alcohol and volatile acidity don’t have any clear relationship between each other.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?
Yes , I observed positive relationship between density and fixed acidity ,
positive relationship between fixed acidity and citric acid , and negative relationship between pH and fixed acidity. Other variables either show very weak relationship or do not show any relationship.

What was the strongest relationship found?

With quality , alcohol is positively related whereas volatile.acidity is negatively related. I observed positive relationship between density and fixed acidity and negative between pH and fixed acidity. Other features of interest show weak relationship.

Multivariate Plot Section

Now let’s visualise the relationship between volatile.acidity , alcohol and quality.

The plot shows tht higher quality wines are concentrated in top left corner , which signifies lower volatile.acidity and higher alcohol w.r.t quality , which we found in above analysis as well.

Now let’s analyze sulphate levels and alcohol wrt quality

This shows that higher quality red wines are generally located near the upper right of the scatter plot (darker contour lines) wheras lower quality red wines are generally located in the bottom right.

Let’s visualise wine_quality variable created with other factors.

The densities of high quality wines are concentrated between 0.994 and 0.998 ,
and the lower part of volatile acidity (y axis)

We can see that red dots are mostly concentrated in top left corner of the plot which signifies lower volatile acidity and higher alcohol.

The distribution of low and average quality wines seem to be concentrated at fixed acidity values that are between 6 and 10. pH increases as fixed acidity decreases , and citric acid increases as fixed acidity increases.

Now let’s generate a linear model based on above features.

## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = wine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = wine)
## 
## ==========================================
##                        m1         m2      
## ------------------------------------------
##   (Intercept)       6.566***   3.095***   
##                    (0.058)    (0.184)     
##   volatile.acidity -1.761***  -1.384***   
##                    (0.104)    (0.095)     
##   alcohol                      0.314***   
##                               (0.016)     
## ------------------------------------------
##   R-squared            0.153      0.317   
##   adj. R-squared       0.152      0.316   
##   sigma                0.744      0.668   
##   F                  287.444    370.379   
##   p                    0.000      0.000   
##   Log-likelihood   -1794.312  -1621.814   
##   Deviance           883.198    711.796   
##   AIC               3594.624   3251.628   
##   BIC               3610.756   3273.136   
##   N                 1599       1599       
## ==========================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When looking at wine quality level , we see a positive relationship between fixed acidity and citric acid.

Were there any interesting or surprising interactions between features?

Residual sugar , supposed to play an important part in wine taste , actually has very little impact on wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths

and limitations of your model.

Yes , I created 2 models.Their R squared values are under 0.4 , so they do not provide us with enough explanation about the variability of the response data around their means.

Final Plots

Plot 1

Description 1

Most of the wine quality are rated as 5 or 6(). Although the rating scale is between 0 and 10 , there’s no wine rated as 1 , 2 , 9 or 10.

Plot 2

Description 2

I observed positive correlation between quality level and alcohol. Men tend to grade stronger wines as high quality , whereas wines with low percent alcohol are often not graded as such. Alcohol is the main carrier of aroma and bouquet and hence flavours of wine. Hence the plot justifies , the higher the alcohol level , more is the quality level of wine.

Plot 3

Description 3

We observed the opposite direction to which quality levels are heading. Wine with high percent alcohol content and low volatile acidity tends to be rated as high quality wine. Based on the result , we can see that the volatile acidity in wine and percent alcohol content are two important components in the quality and taste of red wines.

Reflection

The wines data set contains information on 1599 wines across twelve variables from around 2009. Although , there are less plots in the submission , but I did a lot visualisation and posted some of the plots I deemed useful. I had to go through each variable in the dataset , and yes it is tedious. But it was fun making this notebook. I was stuck at doing multivariate analysis, as R is new to me. So I rewatched the Udacity videos, followed some tutorials on the net. There was a trend between the volatile acidity of a wine and its quality. There was also a trend between the alcohol and its quality. There were very few wines that are rated as 1 , 2 , 9 , 10. So we could improve the quality of our analysis by collecting more data on the wines with above levels , and creating more variables like the country from where a particular wine was made. We can also include price as factor and see if that changes quality of wine or not. This will certainly improve the accuracy of the prediction models. Having said that , we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.